Phonemic segmentation of narrative speech in human cerebral cortex

Speech processing requires extracting meaning from acoustic patterns using a set of intermediate representations based on a dynamic segmentation of the speech stream. Using whole brain mapping obtained in fMRI, we investigate the locus of cortical phonemic processing not only for single phonemes but also for short combinations made of diphones and triphones. We find that phonemic processing areas are much larger than previously described: they include not only the classical areas in the dorsal superior temporal gyrus but also a larger region in the lateral temporal cortex where diphone features are best represented. These identified phonemic regions overlap with the lexical retrieval region, but we show that short word retrieval is not sufficient to explain the observed responses to diphones. Behavioral studies have shown that phonemic processing and lexical retrieval are intertwined. Here, we also have identified candidate regions within the speech cortical network where this joint processing occurs.

Regularized linear regression was used to predict BOLD activity in individual voxels. Regression weights (w) were estimated for each voxel and each feature space. The linear regression used the stimulus features in the four time windows preceding the BOLD activity at time t; the time windows are 2sec long, corresponding to the scanner TR (0.5Hz). B. Voxelwise Model Validation: The quality of the models' predictions was assessed using one 10−min story not included during model estimation (validation story). The BOLD (green) response to this validation story (green) was compared to the prediction (magenta) to calculate the cross-validated R 2 .  Before fitting models that capture particular phonemes or semantic meanings, the fraction of the BOLD response predicted by the baseline model was subtracted, in order to eliminate the BOLD responses explained by the mere presence versus absence of auditory speech stimuli. We then fitted a joint phonemic model with all phonemic features (single phonemes, diphones and triphones) and a joint phonemic-semantic model with all the phonemic and semantic features. Afterwards, we used variance partitioning to obtain unique variance explained by each phonemic feature and all the possible combinations of these phonemic features (single phonemes + diphones, diphones + triphones, single phonemes + triphone, single phonemes + diphones + triphones). The unique variance explained by semantic feature is obtained by subtracting from the variance explained of phonemic-semantic model from that of phonemic models.

Model Name
Supplementary Figure 4 Validation by simulation: sensitivity to longer combinations of phonemes. To explore the limits of phonemic combinations that could be recovered from our dataset, we also simulated responses to higher order combinations. In this figure (similarly to Figure Supplementary Figure 3A), each subpanel shows the results obtained applying nested VM to sets of three putative voxels: one sensitive to the phoneme count, one to different phonemic combinations (single phoneme, diphone, triphone, tetraphone, pentaphone, and hexaphone) and one to the mixture of both. For each voxel (n=1), we repeated this procedure 30 times for single phoneme simulation, 50 times for diphones, and 150 times for triphone, tetraphone, pentaphone and hexaphone based VMs to obtain the distribution of the prediction performance. As shown in the plot, given the measured SNR and about 2 hours of data, one cannot recover the signals of putative voxels sensitive for phonemic combinations beyond the triphone. Supplementary Figure 6 Prediction performance of phonemic processing on flat maps for the remaining 10 subjects. Each column of the cortical flatmaps shows the significant prediction performance of the unique contributions from single phoneme, diphone, triphone, phoneme + diphone, diphone + triphone, phoneme + triphone and single phoneme + diphone + triphone features using variance partitioning for each subject. (Subject 5 is shown on the main Figure 3).  Supplementary Figure 7 Phonemic processing: hemispheric analysis. In order to examine whether the significant prediction performance from single phoneme, diphone and triphone features varied significantly between left (box) and right (notched box) hemispheres across regions of interest, we performed a linear mixed-effect model with post-hoc Tukey test. The statistics are derived from the performance of significant phonemic voxels for each ROI (n =0 − 1, 973 voxels) over 11 independent subjects. All tests are two sided and corrected for multiple comparisons. The exact p values can be found in the Laterality of the phonemic representation and segmentation section. The boxplots are defined the same way as in Figure  4 in the manuscript. The dots show the data of individual subjects with grey dots (*p< 0.05, **p< 0.01 and ***p< 0.001). In STG, STS and Broca's area, the diphone feature space explains significantly more variance than single phoneme or triphone features in the left hemisphere than that in the right hemisphere. It indicates that the diphone segmentation for phonemic processing is more prominent in the left hemisphere than the right hemisphere.  Table  6). Panel A shows the average prediction performance of all the significant cerebral cortex voxels of 11 subjects (each dot represents one subject) from the diphone statistics model (orange), the diphone identity model (blue), and both models (green). It shows that the performance of the diphone statistics model is significantly lower than that of the diphone identity model (blue) (t(1) = −30.19, p < 2.2 10−16 ). All tests are two sided and corrected for multiple comparisons. The boxplots are defined the same way as in Figure 4 in the manuscript. Panel B shows the anatomical data for one example subject (S5). Blue voxels are better predicted by the diphone (content) model, and orange voxels are better predicted by the diphone statistics model. White voxels are equally well predicted by both models. This analysis shows that although diphone statistical properties can explain some of the variance of BOLD responses that could be captured by the diphone model, the actual diphone identities captured in the diphone model and not in the phonotactic probabilities yield significant additional explanatory power. There is no significant difference in the contribution consonants and vowels towards the single phoneme encoding. In addition, "vc" and "cv" combinations contribute the most to the diphone encoding. For both panel A and C, the statistics are derived from the performance of significant voxels (n =2, 409 − 11, 668 voxels) over 11 independent subjects. The boxplots are defined the same way as in Figure 4 in the manuscript. In order to examine whether the significant prediction performance from the phonemic model and semantic features are varied between left (box) and right (notched box) hemisphere across regions of interest, we performed a linear mixed-effect model with post-hoc Tukey test. The statistics are derived from the performance of significant phoneme-semantic voxels for each ROI (n =56 − 19, 511 voxels) over 11 independent subjects. All tests are two sided and corrected for multiple comparisons. The exact p values can be found in the Laterality of phonemic versus semantic representations section. The boxplots are defined the same way as in Figure 4 in the manuscript. In STG, phonemic model predicts significantly higher response in the right hemisphere, while it explains significantly more in STS of both hemispheres. In LTC and Broca's area, semantic features predict significantly higher response in the left hemisphere, while it explains significantly more variance of both cerebral cortical hemispheres. It indicates that the difference in semantic and phonemic processing is more prominent in the right hemisphere in STG and in the left hemiphere in LTC and Broca's area, while this difference is more balanced in the PAC and STS of both hemispheres.  Figure 7). Green voxels are better predicted by phonemic model, while red voxels by semantic model. Yellow voxels share equally good prediction performance from phonemic and semantic models. These figures indicate two gradients of phonemic to semantic cortical representations. One in the temporal cortex, and another in the IPFC. To quantify the phonemic to semantic transition, the center of mass of voxels in temporal cortex and IPFC in both hemispheres with significant prediction performance from phonemic versus semantic features for each subject is shown. It indicates that there is a medial/lateral gradient for phonemic to semantic transition in the temporal cortex of both hemispheres. Panel A shows the average prediction performance (R 2 ) across significant voxels obtained from second order phonemic model (single phoneme + diphone: cyan), third order phonemic model (single phoneme + diphone + triphone: yellow), diphone only (blue) and triphone only (orange). The boxplots are defined the same way as in Figure 4 in the manuscript. It reveals that diphone features' prediction performance is significantly higher than the predictions obtained from triphone features (blue). This effect is observed both for analyses based on 2 sessions and 5 sessions of data collection. There is a small difference in terms of additional prediction obtained from the triphone features after taking into account diphone features when more data is used but our central result remains: the diphone features play a more important role than the triphone features to explain the bold response in phonemic cortical regions. Panel B shows the prediction performance of diphone features is dominant in both 2 sessions and 5 sessions data. The blue voxels are better explained by the diphone features, while the orange voxels are better explained by the triphone features. White voxels are equally well predicted by diphone and triphone features. The flatmap reveals more blue voxels than orange or white voxels irrespective of whether 2 or 5 sessions or data are used.

SNR
Supplementary Figure 16 Signal-to-noise ratio of BOLD signal for each subject. The SNR for each voxel of each subject is obtained from computing the coefficient of determination between two repeats of the BOLD responses collected when the subject was listening to the same story. The plot shows the histogram of SNR for each subject (each color represents one subject). It reveals that our simulation based on SNR=1 is representative to detect the phonemic feature sensitivity given our data size.
Supplementary Table 1 Table of prediction performance of phonemic processing in each ROI for each subject and statistics based on subject averages. The additive contribution of single phonemes, diphones and triphones to the prediction performance obtained with the Phonemic VM is shown for each subject for the entire cortex, Broca's area and the ROIs in the temporal cortex. The average and standard errors of estimated across subjects are shown in the last two rows. These data show that diphone prediction performance is significantly higher than single phoneme and triphones on average across cerebral cortex (11/11 subjects), Broca's area (9/11 subjects), STG (11/11 subjects), STS (11/11 subjects) and LTC (11/11 subjects). It indicates that the most important phonemerelated representations in the brain occur at the level of diphones. All tests are two sided. Yellow background: p < 0.05, Orange background: p < 0.01 and Red background: p < 0.001 obtained in a post-hoc pairwise t-test with Bonferroni correction.
Supplementary Table 2 Table of voxel counts and corresponding statistics of phonemic processing in each ROI for each subject based on subject averages with logistic regression. In order to quantify the number of voxels that is best explained by single phoneme, diphone or triphone features, we assigned each cortical voxel to the best predictive feature. Then, we quantified the effect size of the difference in the average of the number of voxels best explained by each feature for each subject. This effect size is calculated as the average of the negative logits of the probability of being best explained by the single phone and triphone relative to the probability of being best diphone-based probability. The average and standard errors estimated across subjects are summarized as well. We performed a two-sided statistical analysis using mixed effect multinomial logistic regression which takes into account the varying number of voxels in each subject in its likelihood. This statistical analysis compares the actual probability of the number of voxels where the single phoneme, diphone and triphone contributions are higher to what is expected by chance given here by equal probability. We then use a likelihood ratio test comparing statistical mixed-effect model with fitted p single and p triphone to the equal probability p single =p triphone =1/3. On average across the whole cerebral cortex, the additive prediction from the diphone features was consistently higher than that of the single phoneme or triphone features with l = 1.802 + −0.172(2SE) (χ 2 (2) = 1012.22, p < 2.2 × 10 −16 , 11/11 subjects.Yellow background: p < 0.05, Orange: p < 0.01 and Red: p < 0.001 is obtained in a post-hoc pairwise t-test with Bonferroni correction). In the temporal cortex, the probabilities were estimated using a multinomial mixed effect statistical model with the subject as the random effect and the temporal cortex ROIs as a fixed effect (four levels: AC, STG, STS, and LTC). This statistical model was compared to the mixed effect multinomial model that did not include ROIs as fixed effect but did include the two non zero intercepts (yielding estimates for p single and p triphone distinct from 1/3) and subject as a random effect. This interaction was significant (χ 2 (6) = 227.88, p < 2.17 × 10 −46 , 9/11 subjects) and showed a systematic increase in effect size from AC to STG and STS: the single phoneme, diphone and triphone-based models are equally good in AC (l = 0.391 + −0.304(2SE), 4/11 subjects) but the diphone-based model becomes increasingly more and more dominant in STG (l = 1.600 + −0.229(2SE), 11/11 subjects ) and STS (l = 2.423 + −0.223(2SE), 11/11 subjects) and less so in LTC (l = 1.852 + −0.236(2SE), 9/11 subjects). This indicates that the additive contribution of the diphone feature is higher than the contribution of single phonemes and triphones in STG and STS but not in AC or LTC. In Broca's area, the additive prediction from the diphone feature was consistently higher than that of the single phoneme or triphone feature with l = 1.790 + −0.305(2SE) (χ 2 (2) = 287.76, p < 2.2 × 10 −16 , 9/11 subjects). In sum, these results are in line with those obtained giving equal weighting to all subjects and with the results also obtained using the actual predicted value shown in Figure 4 of the main paper.
Supplementary Table 3 Table of prediction performance of each diphone category for each subject. The contribution to the prediction of short words is significantly higher than the beginning of words and diphone residuals for the subjects with red background in the table. The average and standard errors of estimated across subjects are shown in the last two rows. Yellow background: p < 0.05, Orange background: p < 0.01 and Red background: p < 0.001 obtained in a post-hoc pairwise two sided t-test with Bonferroni correction.
Supplementary Table 4 Table of prediction performance performance of Phonemic-Semantic processing in each ROI for each subject and statistics based on subject averages. The table shows the mean prediction performance of each subject's BOLD response explained by the third order phonemic and semantic based VM for each ROI (whole cortex, AC, STG, STS, LTC, and Broca area). The average and standard errors of the mean performance estimated across subjects for each ROI are presented in the last two rows. These data show that semantic prediction performance is significantly higher than phonemic features on average across cerebral cortex (9/11 subjects), Broca's area (4/11 subjects), and LTC (5/11 subjects), while phonemic prediction performance is higher than the semantic features in STG (3/11 subjects) and STS (6/11 subjects). Yellow background: p < 0.05, Orange background: p < 0.01 and Red background: p < 0.001 obtained in a post-hoc pairwise two sided t-test with Bonferroni correction. Phonemic segmentation of narrative speech in human cerebral cortex Supplementary Table 5 Table of voxel counts and corresponding statistics of Phonemic-Semantic processing in each ROI for each subject based on subject averages with logistic regression. In order to quantify the number of voxels that is best explained by phonemic or semantic model, we assigned each cortical voxel to the best predictive feature. Then, we quantified the effect size of the difference by calculating the negative logit of the probability based on the number of voxels being best explained by the Phonemic features vs the additive contribution of the semantic feature in those significant voxels (winner-take-all count analysis). The average and standard errors of the logits estimated across subjects for each ROI are presented in the last two rows. Using subject mean and two SE, one could conclude that Semantic VM is more predictive in the whole cortex, in LTC and in Broca's area (just reaching the threshold of significance). To better model the statistical significance for this count data, we also used mixed-effect logistic regression, which incorporates the number of voxels in each subject in its likelihood. In logistic regression, we statistically assessed whether the number of voxels best explained by the phonemic vs semantic features are different from expectations based on the binomial distribution. The number of voxels best explained by semantic versus phonemic features was found to be significantly higher throughout the whole cortex with the logit equal to l = 0.159+−0.106(2SE) (Likelihood ratio test comparing statistical mixed-effect model with equal probability of semantic and third order phonemic voxels: χ 2 (1) = 6.61, p = 1.0×10 −2 ). The number of voxels best explained by the semantic based VM was also higher in Broca's area: l = 0.598 + −0.266(2SE) (χ 2 (1) = 11.51, p = 6.9 × 10 −4 ). In order to further examine how different temporal cortical areas involved in the phonemic versus semantic processing, we used a generalized mixed-effect linear statistical model (glmer with fam-ily=binomial) with the temporals cortex ROIs (four levels: PAC, STG, STS, and LTC) as regressor, subject as the random effect and the fraction of voxels best explained by the semantic feature as the response variable. We found that the fraction of voxels best explained by the semantic based VM varied significantly across ROIs (likelihood ratio test with nested model that does not include ROIs, χ 2 (3) = 583.80, p < 2.2 × 10 −16 ). Moreover, the number of voxels best predicted by the semantic feature is significantly higher in LTC l = 0.244 + −0.068(2SE) (χ 2 (1) = 19.21, p = 1.2 × 10 −5 ), but the differences are negligible in STG l = −0.070 + −0.160(2SE) (χ 2 (1) = 0.74, p = 3.9 × 10 −1 ) and STS l = 0.293 + −0.240(2SE) (χ 2 (1) = 4.78, p = 2.9 × 10 −2 ). These results are in line with those obtained giving equal weighting to all subjects and with the results also obtained using the actual predicted value shown in Figure 6.  Table 6 Table of definition of diphone phontactic probability features. The table summarizes the name and content of each feature for modeling diphone phontactic probability. Phonotactic probabilities refer to the concurrence likelihood of the sequence of sounds that are present in a given word [1,2]. Diphone probability average refers to the average likelihood of each diphone occurring in each position of a word. In these measures, the syllable stress placement of vowels can also be considered. For stressed calculations, identical vowel sounds are considered to be distinct phonemes depending on primary, secondary, or no-stress placement. In unstressed calculations, vowel sounds are considered to be single phoneme categories.

Feature Construction
In order to explore how the properties of vowel and consonants contribute to the cerebral cortical encoding of phonemes and phonemic combinations, we created two features based on the identity of consonants and vowels of each phoneme. Single-vc features uses a one-hot coding matrix (dimension: [number of TR, 2]) to encode if a single phoneme is a consonant or vowel. Diphone-vc features (dimension: [number of TR, 6]) encodes if a diphone belongs to any of these six consonant/vowel combinations: vowel-vowel (vv), vowel-consonant (vc), consonant-vowel (cv) and consonant-consonant (cc), vowel-blank (vb), consonant-blank (cb) Afterwards, We then fitted four nested VMs: single vc model (using singlevc feature), first order vc model (using single-vc + single phoneme features), first order vc + diphone vc model (using single-vc + single phoneme + diphonevc features) and second order vc model (using single-vc + single phoneme + diphone-vc + diphone features). The variance explained by the single-vc feature is obtained from the single vc model, while the additional variance explained by diphone-vc feature is obtained from subtracting the explainable variance of the first order vc model from the explainable variance of the first order vc + diphone vc model. In addition, the additional variance explained by the single phoneme identities beyond vowel and consonant categories is obtained from subtracting the explainable variance of the single vc model (using single-vc feature) from the first order vc model. Similarly, the variance explained by the diphone identities beyond the combination of vowel and consonants is obtained from subtracting the explainable variance of the first order vc + diphone vc model from the second order vc model. Furthermore, In order to compare the relative contribution of vowels and consonants towards the phoneme encoding, we examined the modeling weights (coefficients) of single-vc (consonant or vowels) and diphone-vc (vv, vc, cv, cc, vb, cb) features. The weights of each voxel have been scaled by the prediction performance of this voxel in order to get rid of the random effects from noisy voxels.
4 Supplementary Notes 4.1 Laterality of the phonemic representation and segmentation.